Skip to content

SPLAT-2637: Promoting gate AWSServiceLBNetworkSecurityGroup to GA#2717

Merged
openshift-merge-bot[bot] merged 1 commit intoopenshift:masterfrom
mtulio:SPLAT-2637-ga-aws-ccm-nlb-sg
Mar 10, 2026
Merged

SPLAT-2637: Promoting gate AWSServiceLBNetworkSecurityGroup to GA#2717
openshift-merge-bot[bot] merged 1 commit intoopenshift:masterfrom
mtulio:SPLAT-2637-ga-aws-ccm-nlb-sg

Conversation

@mtulio
Copy link
Contributor

@mtulio mtulio commented Feb 18, 2026

PR blocked by:

Is blocking GA on Hypershift: TPNU ref openshift/hypershift#7460

Feature OCPSTRAT-1553 / SPLAT-2553

Tests cloud-provider-aws-e2e-openshift have been added to this feature, as long with the cloud-provider-aws-e2e exercising CCM though OTE.

@openshift-ci-robot
Copy link

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Feb 18, 2026
@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 18, 2026

@mtulio: This pull request references SPLAT-2637 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

PR blocked by:

Is blocking GA on Hypershift: TPNU ref openshift/hypershift#7460

Feature OCPSTRAT-1553 / SPLAT-2553

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link

coderabbitai bot commented Feb 18, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 0dc5047f-aadc-4d37-be11-69b04fbcda95

📥 Commits

Reviewing files that changed from the base of the PR and between e188bac and 98fa4f3.

📒 Files selected for processing (4)
  • features.md
  • features/features.go
  • payload-manifests/featuregates/featureGate-4-10-SelfManagedHA-Default.yaml
  • payload-manifests/featuregates/featureGate-4-10-SelfManagedHA-OKD.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
  • features.md

📝 Walkthrough

Walkthrough

The AWSServiceLBNetworkSecurityGroup feature gate was refactored to support cluster-profile-aware configuration. The feature gate declaration was updated to include two separate enable branches: one targeting SelfManaged profiles and another targeting Hypershift profiles. Correspondingly, AWSServiceLBNetworkSecurityGroup was transitioned from disabled to enabled status in multiple feature gate manifests, including the Default and OKD configurations. Documentation in features.md was updated to reflect these configuration changes.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: promoting the AWSServiceLBNetworkSecurityGroup feature gate to GA, which aligns with all file modifications.
Description check ✅ Passed The description is related to the changeset, providing context about the feature promotion, blocked dependencies, test coverage, and verification status.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Stable And Deterministic Test Names ✅ Passed PR does not modify any Ginkgo test files; changes are limited to feature gate configuration, documentation, and YAML manifests.
Test Structure And Quality ✅ Passed This PR does not introduce or modify any Ginkgo test code. The only test file uses the standard Go testing framework, not Ginkgo patterns.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.5.0)

Error: build linters: unable to load custom analyzer "kubeapilinter": tools/_output/bin/kube-api-linter.so, plugin: not implemented
The command is terminated due to an error: build linters: unable to load custom analyzer "kubeapilinter": tools/_output/bin/kube-api-linter.so, plugin: not implemented

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Comment @coderabbitai help to get the list of available commands and usage tips.

@mtulio
Copy link
Contributor Author

mtulio commented Feb 18, 2026

/test ?

@qodo-code-review
Copy link

qodo-code-review bot commented Feb 18, 2026

PR-Agent: could not fine a component named ? in a supported language in this PR.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 18, 2026
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 18, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 18, 2026

Hello @mtulio! Some important instructions when contributing to openshift/api:
API design plays an important part in the user experience of OpenShift and as such API PRs are subject to a high level of scrutiny to ensure they follow our best practices. If you haven't already done so, please review the OpenShift API Conventions and ensure that your proposed changes are compliant. Following these conventions will help expedite the api review process for your PR.

@openshift-ci openshift-ci bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Feb 18, 2026
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 23, 2026
@mtulio mtulio force-pushed the SPLAT-2637-ga-aws-ccm-nlb-sg branch from d29a368 to 8a2811f Compare February 23, 2026 22:41
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 23, 2026
@mtulio
Copy link
Contributor Author

mtulio commented Feb 23, 2026

/test all

@qodo-code-review
Copy link

ⓘ Your monthly quota for Qodo has expired. Upgrade your plan
ⓘ Paying users. Check that your Qodo account is linked with this Git user account

@mtulio
Copy link
Contributor Author

mtulio commented Feb 24, 2026

/test verify-feature-promotion
/test e2e-aws-ovn
/test e2e-aws-ovn-hypershift
/test e2e-aws-ovn-hypershift-conformance
/test e2e-aws-ovn-techpreview

@qodo-code-review
Copy link

ⓘ Your monthly quota for Qodo has expired. Upgrade your plan
ⓘ Paying users. Check that your Qodo account is linked with this Git user account

@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 24, 2026

@mtulio: This pull request references SPLAT-2637 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

PR blocked by:

Is blocking GA on Hypershift: TPNU ref openshift/hypershift#7460

Feature OCPSTRAT-1553 / SPLAT-2553

Tests cloud-provider-aws-e2e-openshift have been added to this feature, as long with the cloud-provider-aws-e2e exercising CCM though OTE.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@mtulio
Copy link
Contributor Author

mtulio commented Feb 24, 2026

e2e-aws-ovn-hypershift-conformance is legitm, I just created the skip openshift/release#75211 while we are waiting for the feature PR in hypershift openshift/hypershift#7460

@mtulio
Copy link
Contributor Author

mtulio commented Feb 24, 2026

verify-feature-promotion is also failing locally.

$ make verify-feature-promotion 
hack/verify-promoted-features-pass-tests.sh
comparing against master
...
Query sippy for all test run results for pattern "FeatureGate:AWSServiceLBNetworkSecurityGroup]" on variant main.JobVariant{Cloud:"aws", Architecture:"amd64", Topology:"single", NetworkStack:""}
Querying sippy release 4.22 for test run results
F0224 11:24:48.988679 2287144 root.go:64] Error running codegen: Get "https://sippy.dptools.openshift.org/api/tests?filter=%7....erator%22%3A%22and%22%7D&period=default&release=4.22":
 context deadline exceeded (Client.Timeout exceeded while awaiting headers)
make: *** [Makefile:95: verify-feature-promotion] Error 255

@mtulio
Copy link
Contributor Author

mtulio commented Mar 3, 2026

/test e2e-aws-ovn-techpreview

@qodo-code-review
Copy link

qodo-code-review bot commented Mar 3, 2026

PR-Agent: could not fine a component named e2e-aws-ovn-techpreview in a supported language in this PR.

@mtulio
Copy link
Contributor Author

mtulio commented Mar 3, 2026

/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ccm-techpreview

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 3, 2026

@mtulio: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ccm-techpreview

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/f8192380-1722-11f1-9762-ff925bc266de-0

@mtulio mtulio force-pushed the SPLAT-2637-ga-aws-ccm-nlb-sg branch from 8a2811f to a02121c Compare March 5, 2026 14:50
@openshift-ci openshift-ci bot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 5, 2026
@mtulio
Copy link
Contributor Author

mtulio commented Mar 5, 2026

/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ccm-techpreview

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 5, 2026

@mtulio: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ccm-techpreview

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/beb86480-18a2-11f1-8079-b2880f72d5dc-0

@mtulio
Copy link
Contributor Author

mtulio commented Mar 5, 2026

/test all

@mtulio
Copy link
Contributor Author

mtulio commented Mar 6, 2026

Based in verify-feature-promotion, I've updated this PR to focus on HA profile as that's the target for GA.

@mtulio
Copy link
Contributor Author

mtulio commented Mar 6, 2026

Hey @JoelSpeed , would you mind reviewing this PR? I am awaiting for tests to populate Sippy, meanwhile looking for feedback to have time to fix anything that might not be ready (except those tests). Thanks in advance.

Holding this PR until get tests populated for verify-feature-promotion:

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 6, 2026
@mtulio
Copy link
Contributor Author

mtulio commented Mar 9, 2026

Just ran the make verify-feature-promotion and got:

Query sippy for all test run results for feature gate "AWSServiceLBNetworkSecurityGroup" on clusterProfile ["SelfManagedHA"]
Query sippy for all test run results for pattern "FeatureGate:AWSServiceLBNetworkSecurityGroup]" on variant main.JobVariant{Cloud:"aws", Architecture:"amd64", Topology:"ha", NetworkStack:""}
Querying sippy release 4.22 for test run results
Query sippy for all test run results for pattern "FeatureGate:AWSServiceLBNetworkSecurityGroup]" on variant main.JobVariant{Cloud:"aws", Architecture:"amd64", Topology:"single", NetworkStack:""}
Querying sippy release 4.22 for test run results
Sufficient CI testing for "AWSServiceLBNetworkSecurityGroup".

I am addressing the Joel's feedback of HCP TP strategy in parallel of self-managed.

This change promotes the feature gate AWSServiceLBNetworkSecurityGroup
from TechPreviewNoUpgrade to Default feature set on self-managed.

Also the feature AWSServiceLBNetworkSecurityGroup is kept TP and DP on
Hypershift while the implementation is finished on Hypershift.

The tests covering this feature is added by
openshift/cloud-provider-aws#129.
@mtulio mtulio force-pushed the SPLAT-2637-ga-aws-ccm-nlb-sg branch from e188bac to 98fa4f3 Compare March 9, 2026 14:10
@openshift-ci openshift-ci bot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 9, 2026
@mtulio
Copy link
Contributor Author

mtulio commented Mar 9, 2026

/pipeline required

@openshift-ci-robot
Copy link

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aws-ovn
/test e2e-aws-ovn-hypershift
/test e2e-aws-ovn-hypershift-conformance
/test e2e-aws-ovn-techpreview
/test e2e-aws-serial-1of2
/test e2e-aws-serial-2of2
/test e2e-aws-serial-techpreview-1of2
/test e2e-aws-serial-techpreview-2of2
/test e2e-azure
/test e2e-gcp
/test e2e-upgrade
/test e2e-upgrade-out-of-change
/test minor-e2e-upgrade-minor

@mtulio
Copy link
Contributor Author

mtulio commented Mar 9, 2026

/test e2e-aws-ovn-hypershift

/test e2e-aws-ovn-techpreview

/test e2e-aws-serial-1of2

@mtulio
Copy link
Contributor Author

mtulio commented Mar 9, 2026

e2e-azure failed on destroy, failing to delete RG. Unrelated to this change, since it's not required and feature not related to the platform, I will skip the re-run.

@mtulio
Copy link
Contributor Author

mtulio commented Mar 9, 2026

e2e-aws-serial-1of2 and e2e-azure are failing for unrelated reasons:

  • e2e-aws-serial-1of2: monitor test on disruption/metrics-api
  • e2e-azure: destroy flow

@mtulio
Copy link
Contributor Author

mtulio commented Mar 9, 2026

Hi Joel, would you mind reviewing it please?
/assign @JoelSpeed

@JoelSpeed
Copy link
Contributor

/lgtm
/verified by CI

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Mar 10, 2026
@openshift-ci-robot
Copy link

@JoelSpeed: This PR has been marked as verified by CI.

Details

In response to this:

/lgtm
/verified by CI

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 10, 2026
@openshift-ci-robot
Copy link

Tests from second stage were triggered manually. Pipeline can be controlled only manually, until HEAD changes. Use command to trigger second stage.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 10, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jcpowermac, JoelSpeed

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 10, 2026
@mtulio
Copy link
Contributor Author

mtulio commented Mar 10, 2026

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 10, 2026
@mtulio
Copy link
Contributor Author

mtulio commented Mar 10, 2026

Investigation of job e2e-aws-serial-1of2 failure by Claude 🤖


PR Analysis by Claude 🤖

Analyzing test failures for job pull-ci-openshift-api-master-e2e-aws-serial-1of2 (ID: 2031076482759004160)


Failed Tests

  • [Monitor:metrics-api-availability][sig-instrumentation] disruption/metrics-api connection/new should be available throughout the test
  • [Monitor:metrics-api-availability][sig-instrumentation] disruption/metrics-api connection/reused should be available throughout the test

Both tests failed with 31 seconds of disruption against an allowed threshold of 6 seconds, reporting 503 Service Unavailable errors starting at Mar 09 19:52:11.


Root Cause

The failures are caused by a test framework interaction between two concurrent tests:

Disruptive Test (running in foreground):

[sig-node] NoExecuteTaintManager Multiple Pods [Serial]
only evicts pods without tolerations from tainted nodes
  • Started: ~19:51:08
  • Completed: 19:53:14
  • Duration: 2m6s
  • Purpose: Validates pod eviction behavior when nodes are tainted

Monitor Test (running in background):

[Monitor:metrics-api-availability][sig-instrumentation]
disruption/metrics-api connection/new|reused
  • Runs continuously throughout the test suite
  • Measures API availability and allowed disruption windows

Event Timeline

19:51:08  NoExecuteTaintManager test begins
          ├─ Adds NoExecute taints to worker nodes
          │
19:52:09  TaintManager triggers mass pod eviction
          ├─ 23 pods evicted from 7 namespaces
          ├─ Including: thanos-querier (serves metrics-api)
          ├─              prometheus-k8s-1
          ├─              metrics-server instances
          └─              alertmanager, kube-state-metrics, etc.
          │
19:52:10  Replacement pods created and scheduled
          ├─ thanos-querier-f9488dbcd-f4pzt
          └─ metrics-server, kube-state-metrics (new instances)
          │
19:52:11  metrics-api becomes unavailable
          ├─ Old thanos-querier terminating
          ├─ New thanos-querier starting (pulling image, initializing)
          └─ Monitor tests detect: 503 Service Unavailable
          │
19:52:42  New monitoring pods become Ready
          ├─ metrics-api service restored
          │
19:53:14  NoExecuteTaintManager test completes
          └─ Taints removed from nodes

Technical Details

Event Breakdown (19:52:09 - 19:52:30):

  • TaintManagerEviction: 23 events
  • Killing: 46 events
  • SuccessfulCreate: 15 events
  • Started: 12 events

Affected Namespaces:

  • openshift-monitoring (11 pods)
  • openshift-cluster-storage-operator
  • openshift-dns
  • openshift-image-registry
  • openshift-ingress-canary
  • openshift-insights
  • openshift-network-console

Critical Impact:
The thanos-querier pod, which serves the metrics-api endpoint, was evicted and rescheduled. During the ~30-second gap between the old pod terminating and the new pod becoming Ready, the metrics-api endpoint returned 503 errors.

Disruption Calculation:

  • Historical P99: 0s (new connections) / 680ms (reused connections)
  • Rounded up: 1s
  • Grace period: 5s
  • Total allowed: 6s
  • Actual disruption: 31s
  • Result: Failed by 25s

Test Framework Issue

This represents a known conflict in the test suite design:

  1. Serial disruptive tests (like NoExecuteTaintManager) intentionally cause node-level disruptions to validate cluster behavior
  2. Background monitor tests continuously measure API availability across all operations
  3. Result: Legitimate test-induced disruptions are flagged as failures by the monitors

The NoExecuteTaintManager test is working correctly - it successfully validated that pods without tolerations are evicted when nodes receive NoExecute taints. The monitoring stack pods (thanos-querier, prometheus, etc.) correctly do not tolerate arbitrary test taints and were properly evicted.


References


Analysis Artifacts

Full analysis available in the investigation artifacts:

  • Event correlation and pod eviction details
  • Node status verification (all nodes Ready, no persistent taints)
  • Monitoring pod lifecycle during the disruption window

@mtulio
Copy link
Contributor Author

mtulio commented Mar 10, 2026

/retest-required

@mtulio
Copy link
Contributor Author

mtulio commented Mar 10, 2026

context deadline exceeded (Client.Timeout exceeded while awaiting headers)

^ verify-feature-promotion while calling to Sippy.

/test verify-feature-promotion

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 10, 2026

@mtulio: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit 1f950f7 into openshift:master Mar 10, 2026
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants